System and method for organizing data

ABSTRACT

A system and method for organizing raw data from one or more sources uses an improved mechanism for identifying duplicate data between fields (e.g., columns) in the databases. The fields may be similar fields within a single database or similar or identical fields within a pair of databases and as organized as arrays or field vectors. The present invention sorts each of the field vectors and if necessary, partitions them by common value. A number of comparisons required to identify the duplicate data between the field vectors is reduced by feeding back a difference between the compared values. This difference is used to adjust indices into the field vectors for subsequent comparison.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application is a continuation application ofco-pending application Ser. No. 09/412,970, entitled “System and Methodfor Organizing Data,” which was filed on Oct. 6, 1999, which in turn isa continuation-in-part application of application Ser. No. 09/357,301,entitled “System and Method for Organizing Data,” which was filed onJul. 20, 1999.

BACKGROUND

[0002] 1. Field of the Invention

[0003] The present invention relates to database systems and moreparticularly, to a system and method for organizing data in a databasesystem.

[0004] 2. Discussion of the Related Art

[0005] Computerized database systems have long been used and their basicconcepts are well known. A good introduction to database systems may befound in C. J. DATE, INTRODUCTION TO DATABASE SYSTEMS (Addison Wesley,6th ed. 1994).

[0006] In general, database systems are designed to organize, store andretrieve data in such a way that the data in the database is useful. Forexample, the data, or partitioned sets of the data, may be searched,sorted, organized and/or combined with other data. To a large extent,the usefulness of a particular database system, is dependent on theintegrity (i.e., the accuracy and/or correctness) of the data in thedatabase system. Data integrity is affected by the degree of “disorder”in the data stored. Disorder may occur in the form of erroneous orincomplete data such as duplicate data, fragmented data, false data,etc. In many database systems, from time to time, existing data may beedited and processed, and as a result, additional errors may beintroduced. In some database systems, new data may be introduced.Additionally, as database systems are upgraded with new hardware and/orsoftware, data conversion may be required or additional fields maybecome necessary. Furthermore, in some applications, the data in thedatabase may simply become outdated over time.

[0007] Regardless of the preventative steps taken, some degree ofdisorder is eventually introduced in conventional database systems. Thisdegree of disorder increases exponentially over time until eventually,the data in a conventional database becomes entirely useless. As aresult, even a small degree of disorder eventually affects the integrityof the database system.

[0008] Unfortunately, identifying and correcting disorder in the dataare often difficult, if not impossible, tasks particularly in largedatabase systems. Traditionally, such tasks are performed manually,making these tasks time-consuming, expensive, and subject to humanerror. Furthermore, due to the very nature of the task, much of thedisorder may go largely undetected. What is needed is a system andmethod for organizing data in a database system to overcome these andother associated problems.

SUMMARY OF THE INVENTION

[0009] The present invention provides a system and method for organizingdata in a database system. The present invention derives a distilleddatabase of accurate data from raw data included in one or more raw datasources. The raw data is converted from its original format(s) to anumeric format. According to one embodiment of the present invention,the raw data is represented as a vector having numeric elements. Oncethe raw data is represented numerically, various mathematical operationssuch as correlation functions, pattern recognition methods, or othersimilar numeric methods, may be performed on these vectors to determinehow content in a particular vector corresponds to others vectors in a“distilled” or reference database. The distilled database is formed fromsets of one or more related vectors that are believed to be unique(e.g., orthogonal) with respect to the other sets. These sets representthe best information available from the raw data. After all the raw datahas been incorporated into the distilled database, new data may bescreened to ensure that new errors are not introduced into the distilleddatabase. The new data may be also evaluated to determine whether it isunique or whether it includes better information than that alreadypresent in the distilled database. The new data is added to thedistilled database accordingly.

[0010] One of the features of the present invention is that raw data isconverted into a numeric format based on a number system having anappropriate radix. An appropriate radix is determined according to thetype of information included in the raw data. For example, for raw datagenerally comprised of alpha-numeric characters, an appropriate radixmay be greater than or equal to the number of different alpha-numericcharacters present in the raw data. Using such a number system allowsraw data to be represented numerically, allowing for manipulationthrough various well-known mathematical operations.

[0011] Another feature of the present invention is that the numbersystem may be selected so that the numbers themselves retain semanticsignificance to the raw data they represent. In other words, thenumerals in the number system are selected so that they correspond tothe raw data. For example, in the case of raw data comprised ofalphanumeric characters, the numerals are selected to correspond to thealphanumeric characters they represent. When the numerals in the numbersystem are subsequently displayed, they appear as the alphanumericcharacters they represent.

[0012] Another feature of the present invention is that once the rawdata is represented as vectors in an appropriate number system, therepresented data may be efficiently manipulated in the database (e.g.,sorted, etc.) using various well-known techniques. Furthermore, variouswell-known mathematical operations may be performed on the vectors toanalyze the data content. These mathematical operations may includecorrelation functions, eigenvector analyses, pattern recognitionmethods, and others as would be apparent.

[0013] Still another feature of the present invention is that the rawdata is incorporated into a distilled database. The distilled databaserepresents the best information extracted from the raw data withouthaving any data disorder.

[0014] Yet another feature of the present invention is that new data maybe compared to the distilled database to determine whether the new dataactually includes any new information or content not already present inthe distilled database. Any new information not already in the distilleddatabase is added to the distilled database without adding any disorder.In this manner, the integrity of the distilled database may bemaintained.

[0015] Other features and advantages of the invention will becomeapparent from the following drawings and description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] The present invention is described with reference to theaccompanying drawings. In the drawings, like reference numbers indicateidentical or functionally similar elements. Additionally, the left-mostdigit(s) of a reference number identifies the drawing in which thereference number first appears.

[0017]FIG. 1 illustrates a processing system in which the presentinvention may be implemented.

[0018]FIG. 2 illustrates stages of data processed by one embodiment ofthe present invention.

[0019]FIG. 3 is a flow diagram for converting raw data from its originalformat into a numeric format in accordance with one embodiment of thepresent invention.

[0020]FIG. 4 illustrates a data record suitable for use with the presentinvention.

[0021]FIG. 5 illustrates raw data tables suitable for use with thepresent invention.

[0022]FIG. 6 illustrates reference data tables, representing dataformatted in accordance with an embodiment of the present invention.

[0023]FIG. 7 is a flow diagram for analyzing reference data inaccordance with an embodiment of the present invention.

[0024]FIG. 8 illustrates distilled data table, representing related datacorrelated in accordance with an embodiment of the present invention.

[0025]FIG. 9 illustrates an example of data clustering in atwo-dimensional space.

[0026]FIG. 10 is a flow diagram for identifying duplicate data among apair of field vectors.

[0027]FIG. 11 is a flow diagram for identifying duplicate data among apair of field vectors in further detail.

[0028]FIG. 12 illustrates an example of identifying duplicate data amonga pair of field vectors.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0029] The present invention is directed to a system and method fororganizing data in a database system. The present invention is describedbelow with respect to various exemplary embodiments, particularly withrespect to various database applications. However, various features ofthe present invention may be extended to other areas as would beapparent. In general, the present invention may be applicable to manydatabase applications where large amounts of seemingly unrelated datamust be compiled, stored, manipulated, and/or analyzed to determine thevarious relationships present in the content represented by the data.More particularly, the present invention provides a method for achievingand maintaining the integrity (i.e., accuracy and correctness) of datain a database system, even when that data initially possesses a highdegree of disorder. As used herein, disorder refers to data that isduplicative, erroneous, incomplete, imprecise, false or otherwiseincorrect or redundant. Disorder may present itself in the databasesystem in many ways as would be apparent.

[0030] One embodiment of the present invention is used to maintain adatabase associated with accounts receivable. In this embodiment, acompany may collect data relating to various persons, businesses and/oraccounts from one or more sources. These sources may include, forexample, credit card companies, financial institutions, banks, retail,and wholesale businesses and other such sources. While each of thesesources may provide data relating to various accounts, each source mayprovide data representing different information based on its own needs.Furthermore, this data may be organized in entirely different ways. Forexample, a wholesale distributor may have data corresponding to accountsreceivable corresponding to business accounts. Such data may beorganized by account numbers, with each data record having data fieldsidentifying an account number, a business associated with that accountnumber, an address of that business, and an amount owed on the account.A retail company may have data records representing similar informationbut based on accounts corresponding to individuals as well asbusinesses.

[0031] In other embodiments of the present invention, other types ofsources may provide different types of data. For example, the scientificinstitutions may provide scientific data with respect to various areasof research. Industrial companies may provide industrial data withrespect to raw materials, manufacturing, production, and/or supply.Courts or other types of legal institutions may provide legal data withrespect to legal status, judgments, bankruptcy, and/or liens. As wouldbe apparent, the present invention may use data from a wide variety ofsources.

[0032] In another embodiment of the present invention, a database may bemaintained to implement an integrated billing and order control system.In addition to billing-type information from sources similar to thosedescribed above, this embodiment may include data records correspondingto inventory, data records corresponding to suppliers of the inventory,and data records corresponding to purchasers of the inventory. Inventorydata may be organized by part numbers, with each data record having datafields identifying an internal part number, an external part number(i.e., supplier part number), a quantity on hand, a quantity expected toship, a quantity expected to be received, a wholesale price, and aretail price. Supplier data may be organized by a supplier number; andcustomer data may be organized by a customer number. Data recordscorresponding to each of these records may include data fieldsidentifying a part number, a part price, a quantity ordered, a shipdata, and other such information.

[0033] Another embodiment of the present invention may include anenterprise storage system that consolidates corporate information frommultiple, dissimilar sources and makes that information available tousers on the corporate network regardless of the type of the data, thetype of computer that generated the data, or the type of computer thatrequested the data. Still another embodiment of the present inventionincludes a business intelligence system that warehouses and marketsinformation and allows that information to be processed and analyzedon-line.

[0034] The present invention enables raw data collected from differentsources to be analyzed and distilled into a collection of accurate data,organized in a way that is useful for a particular application. Usingthe above example of an integrated billing and order control system,explained more fully below, the present invention may produce adistilled database in which related data, such as data relating to aparticular supplier or customer, may be identified as such. In thisexample, duplicate data corresponding to the same supplier or customermay be identified and/or discarded, and erroneous data associated withthe supplier or customer may be identified, analyzed, and possiblycorrected.

[0035] In general, the present invention may be implemented in hardwareor software, or a combination of both. Preferably, the present inventionis implemented as a software program executing in a programmableprocessing system including a processor, a data storage system, andinput and output devices. An example of such a system 100 is illustratedin FIG. 1. System 100 may include a processor 110, a memory 120, astorage device 130, and an I/O controller 140, coupled to one another bya processor bus 150. I/O controller 140 is also coupled via an I/O bus160 to various input and output devices, such as a keyboard 170, a mouse180, and a display 190. Other components may be included in the system100 as would be apparent.

[0036]FIG. 2 illustrates various forms of data processed by the presentinvention. Raw data 210 may be collected from one or more sources, suchas raw data 210A and raw data 210B. As used herein, “raw data” simplyrefers to data as it is received from a particular source. Additionalsources of raw data 210 may be included as would be apparent. Asexplained below, raw data 210 from various sources is converted into anumerical format and stored in a reference database 220. Using a processreferred to herein as “data dialysis,” the present invention “purifies”raw data 210 to form reference data in reference database 220. Referencedatabase 220 includes all the information found in raw data 210including duplicate, incomplete, inconsistent, and erroneous data.

[0037] Distilled data stored in a distilled database 230 is derived fromthe reference data of reference database 220. Distilled data representsthe “accurate” data available from raw data 210. Distilled database 230includes the unique information found in raw data 210. Distilled datathus represents the best information available from raw data 210.

[0038] As also explained below, the present invention further providesfor using distilled database 230 to analyze and verify new data 240,which may also be used to update the reference database 220 anddistilled database 230 as appropriate.

[0039] While the present invention has numerous embodiments, to clarifyits description, a preferred embodiment is explained with reference toFIGS. 3-8 in a context of an integrated billing and order controlsystem. In this embodiment, raw data 210 is a collection of datacollected from various sources, such as order processing, shipping,receiving, accounts payable and accounts receivable, etc. This raw data210 may include data records that are related but have different datafields, duplicate data records, data records having one or moreerroneous data fields, etc. To address such errors, the presentinvention converts raw data 210 from their original formats and datastructures (which may vary based on the source) into a numeric formatand stores this reference data in reference database 220.

[0040] According to the present invention, the reference data is thencompared and analyzed to distill the best information available. In oneembodiment of the present invention, this best information may be storedas distilled data in distilled database 230. This process is nowdescribed.

[0041] Collecting Raw Data

[0042]FIG. 3 illustrates the process by which raw data 210 is convertedinto reference data in reference database 220 according to oneembodiment of the present invention. In a step 310, raw data 210 iscollected from a raw data source. As illustrated in FIG. 2, raw data 210may include data from one or more sources such as raw data 210A and raw210B. As used herein, “data” refers to the physical digitalrepresentation of information, and data “content” refers to the meaningof, or information included in or represented by that data. Thedifferent records in raw data 210 may include similar types of datacontent. For example, in a billing context, different records in rawdata 210 may all include data content relating to a particular account.

[0043] Raw data 210 will typically be received in the form of datarecords 400, as illustrated in FIG. 4. Each data record 400 generallyincludes related information, such as information for a specificindividual, company, or account. Each data record 400 stores thisinformation in one or more data fields 410. Examples of possible datafields 410 include, for example, an account number, a last name, a firstname, a company name, an account balance, etc. Each data field 410, inturn, may include one or more data elements 420 for representinginformation for that specific record and specific field. Data elements420 may exist in various formats, such as alphanumeric, numeric, ASCII,and EBCDIC, or other representation as would be apparent. Raw data 210collected from different sources may be formatted differently. Datarecords 400 may include different data fields 410, and the informationincluded in data fields 410 may be represented using data elements 420in different formats, as would also be apparent.

[0044] Examples of raw data 210 are illustrated in raw data tables 510,520, and 530 of FIG. 5. Data records, such as data record 510-1 and datarecord 510-2, are illustrated as rows of raw data tables 510, 520, and530, whereas data fields, such as data field 510-A and data field 510-B,are illustrated as columns of raw data tables 510, 520, and 530. Thetables illustrated in FIG. 5 are examples of data that might be found invarious embodiments of the present invention. In other embodiments, datamay come from many sources and may be formatted as databases having amuch larger number of data records and/or data fields, as would beapparent.

[0045] Conversion to Numeric Format

[0046] Referring to FIG. 3, in a step 320, the present inventionconverts raw data 210 from its original representation (which may be inalphanumeric, numeric, ASCII, EBCDIC, or other similar formats) to anumeric representation. This ensures that reference data is representedin the same manner. Thus, the reference data, including that data fromdifferent sources, may be similarly processed.

[0047] According to the present invention, raw data 210 is convertedfrom its original representation into an appropriate numericrepresentation. An appropriate numeric representation uses a numbersystem in which each possible value of data element 420 may berepresented by a unique digit or value in the number system. In otherwords, a radix for the number system is selected such that the radix isat least as great as the number of possible values for a particular dataelement. For example, in a biotechnology application for detectingnucleotide sequences of Adenine (A), Guanine (G), Cytosine (C), andThymine (T) in nucleic acids, each data element may be one of only fourvalues: A, G, C, and T. In such an application, a radix of four for thenumber system may be sufficient to represent each data element as aunique number. One such number system may include the numbers A, G, C,and T. In some embodiments of the present invention, it may be desirableto use a radix at least one greater than the number of differentpossible value of data element 420 in order to provide a numberrepresentative of an empty field. In this case, such as number systemmay include the numbers A, G, C, T, and ^ , where ^ is the empty fieldvalue.

[0048] According to a preferred embodiment of the present invention,data elements 420 in raw data 210 are comprised of characters such asalphanumeric characters. In this preferred embodiment, a radix of 40 isselected to represent the alphanumeric characters as illustrated in thetable below. (Note that a minimum radix of 36 is required.) This radixis selected to accommodate the ten numeric characters “0”-“9” and thetwenty-six alphabetic characters “A” to “Z” as well as to allow forseveral additional characters. In this embodiment, uppercase andlowercase characters are not distinguished from one another.

[0049] As illustrated in Table 1, the base-40 number system includes thenumbers 0-9, followed by A-Z, further followed by four additionalnumbers. One of these numbers may used to represent an empty field. Thisnumber is used to represent a data field 410 that is empty or has novalue (in contrast to a zero value). Other numbers may be used, forexample, to represent other types of information such as spaces or usedas control information. TABLE 1 Alpha- Base-10 Base-40 Numeric NumberNumber 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 A ora 10 A B or b 11 B C or c 12 C D or d 13 D E or e 14 E F or f 15 F G org 16 G H or h 17 H I or i 18 I J or j 19 J K or k 20 K L or l 21 L M orm 22 M N or n 23 N O or o 24 O P or p 25 P Q or q 26 Q R or r 27 R S ors 28 S T or t 29 T U or u 30 U V or v 31 V W or w 32 W X or x 33 X Y ory 34 Y Z or z 35 Z — 36 [ — 37 \ — 38 ] — 39 {circumflex over ( )}

[0050] Representation of raw data 210 in a base-40 format has numerousbenefits. One benefit is that raw data 210 may be represented in anumeric fashion, facilitating straightforward mathematical manipulation.Another benefit is that proper selection of both the radix and thenumerals in the number system allows the represented content to maintainsemantic significance, facilitating recognition the content of raw data210 in its representation in the numeric format. For example, the word“JOHN” represented by the four alphanumeric characters “J” “O” “H” “N”may be represented in various number systems. One such number system isa base-40 number system. Using Table 1, representing the alphanumericcharacters “JOHN” as a base-40 number would result in the “tetradecimal”value ‘JOHN’, which is equivalent to the decimal value 1,255,103(19*40³+24*40²+17*40¹+23*40⁰, where base-40 ‘J’ equals decimal 19,etc.). Note that the base-10 number loses semantic significance from thecontent of raw data 210 whereas the base-40 number retains semanticsignificance, as the number ‘JOHN’ is recognizable as the content“JOHN.” Semantic significance provides the benefits of a numericrepresentation while maintaining the ability to convey semantic content.

[0051] In some embodiments of the present invention, the selection of aradix and its corresponding number system may depend upon the number ofbits used by processor 110. The number of bits used by processor 110 andthe radix chosen for the number system define the number characters thatcan be represented by a data word in processor 110. This relationship isgoverned according to the following equation:

N=B*1n(2)/1n(R),

[0052] where N is the number of whole characters (i.e., fractionalcharacters are discarded) represented by a data word of processor 110, Bis the number of bits per data word, and R is the selected radix. Thisrelationship limits the number of data elements 420 of raw data 210 thatmay fit in a data word. For example, in a 32-bit machine, the maximumnumber of characters that may fit in a data word using a base-40 numbersystem is six (32*1n(2)/1n(40)=6.013). The maximum number of charactersthat may fit in a data word using a base-41 number system is only five(32*1n(2)/1n(41)=5.973). Thus, in some embodiments of the presentinvention, in addition to having a radix sufficiently large to maintainsemantic significance, the radix may also be selected to maximize thenumber of characters represented by a single data word. In theembodiment with raw data comprised of alphanumeric characters, anappropriate radix may range from 36 to 40. This range maintains semanticsignificance while maximizing the number of characters represented bythe 32-bit data word. Other types of raw data and other sizes of dataword may dictate other appropriate radix ranges in other embodiments ofthe present invention.

[0053] The embodiment of the present invention described above does notdistinguish between uppercase and lowercase characters. However, otherembodiments of the present invention may distinguish between these typesof characters. Accordingly, a base-64 representation (“0”-“9”, “A-“Z”,“a”-“z”, and two other values) may be appropriate to distinguish betweenthese characters as would be apparent.

[0054] The number of data elements 420 in each data field 410 alsodictates the precision required by the number as represented inprocessor 110. As described above, each data field 410 may only be sixcharacters or data elements 420 wide for single precision operations ina 32-bit machine. In some embodiments of the present invention, this maybe insufficient. In these embodiments, double, triple, or even quadrupleprecision may be required to represent the entire data field 410 as asingle value. Double precision numbers are sufficient for up to twelvecharacter data fields 410; triple precision numbers are sufficient forup to eighteen characters; and quadruple precision numbers aresufficient for up to twenty-four characters.

[0055] Alternate embodiments of the present invention may accommodatelarge data fields by breaking a large data field into one or moresmaller data fields. The large data fields may be broken at boundariesdefined by spaces. For example, a data field representing an addresssuch as “123 West Main Street” may be broken into four smaller datafields: ‘123’, ‘West’, ‘Main’, and ‘Street’. The large data fields mayalso be broken at data word boundaries. In the address example above,the smaller data fields might be: ‘123We’, ‘st\Mai’, ‘n\Stre’, and ‘et’,where the number ‘\’ is used to represent a space. Other embodiments ofthe present invention may accommodate large data fields in other mannersas would be apparent.

[0056] Data Structure Conversion

[0057] As illustrated in FIG. 3, in a step 330, raw data 210 representedas a number is stored in a predefined data structure. In one embodimentof the present invention, this data structure is a single-field table asillustrated by Tables 610-670 of FIG. 6. This data structure may vary.For example, in other embodiments of the present invention, the datastructure may be a multiple-field table instead of a single-field table.In these embodiments, the data structures may be implemented withstandard features such as table headers and indices, and as explained ingreater detail below, may also include probability values for eachrecord. These probability values represent the likelihood that the datain that record is complete. Higher probability values may indicate ahigher probability of completeness, and lower probability valuessimilarly may indicate a lower probability of completeness. This isdescribed in further detail below. Initially, the probability values areset to 0. Other embodiments may also include key numbers oridentification numbers to aid in sorting and in maintainingrelationships among the data records.

[0058] In a preferred embodiment of the present invention, raw data 210illustrated in FIG. 5 includes three tables 510, 520, and 530. Table 510may represent raw data 210 from, for example, a company's accountsreceivable system. Columns of table 510 represent data fields for anaccount number, a last name, a first initial, and additional fields forlisting various orders processed for a particular individual. Rows oftable 510 (such as 510-1 and 510-2) represent data records for differentindividuals. Tables 520 and 530 may represent raw data 210 maintained bycredit card companies. Columns of tables 520 and 530 represent datafields for an account number, a last name, a first name, and an address.Rows of tables 520 and 530 represent data records for specific accounts.

[0059] In the preferred embodiment, step 330 converts raw data 210 fromthe format illustrated in FIG. 5 into a format illustrated in FIG. 6.FIG. 6 illustrates raw data 210, combined from the various raw datatables 510, 520, 530 of FIG. 5, represented as numbers in a base-40number system, and formatted as new tables (tables 610-670), whichtogether may comprise reference database 220.

[0060] Each reference database table 610-670 corresponds to anindividual field from raw data tables 510, 520, and 530 of FIG. 5. Morespecifically, data records of reference data tables 610-670 correspondto the data records of raw data table 510, followed by the data recordsof raw data table 520, followed by the data records of raw data table530. In one embodiment of the present invention, where a raw data tablerecord has no information for a particular data field 410 represented ina reference table 610-670, a empty field value is entered in that fieldin the reference table. For example, the first data record 510-1 ofTable 510 has no information about an address, and thus an empty fieldvalue is placed in the first position of table 670.

[0061] Data is preferably stored in reference database 220 in such a waythat all data corresponding to a single data record in a raw data tableis readily identified. In the embodiment represented in FIGS. 5 and 6,for example, data corresponding to any specific data record of the rawdata tables (tables 510, 520, 530) is preferably represented inreference tables 610-670 as a “vector” of numeric data stored at anindex i across reference tables 610-670. For example, data correspondingto the sixth record 520-6 of raw data table 520 (illustrated as accountnumber “A60” belonging to “Jennifer Brown,” residing at “51 FourthStreet”) is represented in reference database tables 610-670 as a vectorhaving coefficients formed from the tenth records 610-10, 620-10,630-10, 640-10, 650-10, 660-10, and 670-10 of the tables 610-670.

[0062] As illustrated in FIG. 6, reference database 220 includes a newtable 610 that does not correspond to any data field 410 in raw data 210illustrated in FIG. 5. This table is a “key table” that identifies therelated data in these data vectors. As described below, referencedatabase 220 comprised of the tables illustrated in FIG. 6 may includeadditional key tables for data fields. These may include a personalidentification number (“PIDN”), an account identification number(“AIDN”), or other types of identification numbers. These key tables oridentification numbers may be used to identify sets of related datavectors in reference database 220.

[0063] In this example, key table 610 has a single field “PIDN,” whichstands for personal identification number. Key table 610 provides aunique identifier such that a specific PIDN number never refers to morethan one person represented in raw data 210. In other words, the PIDNnumber reflects the fact that many multiple records in raw data 210 mayrefer to the same person.

[0064] Preferably, each data record in the key table 610 initiallycorresponds to a different data record represented in the raw datatables 510, 520, and 530. For example, in FIG. 6, data record 610-10 inthe key table 610 is implemented such that it includes identifiers (suchas pointers or indices) for corresponding data in reference tables620-670, which together corresponds to a single record 520-6 in raw datatable 520.

[0065] Initially, while a single PIDN does not refer to multipleindividuals, a single individual may correspond to multiple PIDNs. Forexample, in FIG. 6, vector 4 (defined by PIDN 4) and vector 9 (definedby PIDN 9) appear to refer to the same person, but as illustrated, thisperson is initially assigned to two PIDN numbers—PIDN 4 and PIDN 9. Asdescribed below, the present invention enables a determination whetherPIDN 4 and PIDN 9 do, in fact, refer to the same individual, and if so,assigns a single PIDN to this individual. Alternatively, someembodiments may assign a new PIDN number to individuals so determinedand a reference to the old PIDN number may be retained.

[0066] As discussed above, in this embodiment, records are representedin the reference database tables 610-670 as vectors having coefficientsof base-40 numbers across eight one-field tables. This numericrepresentation allows the data to be analyzed using straightforwardmathematical operations that may be used to, for example, producecorrelations, calculate eigenvectors, perform various coordinatetransformations, and utilize various pattern recognition analyses. Theseoperations may, in turn, be used to provide or derive information aboutthe records and their relationships to one another. By using small,one-field tables, these operations may be performed quickly. Inaddition, as will be illustrated, representation in base-40 numbers withraw data 210 including alphanumeric characters allows content of rawdata 210 to retain its semantic significance.

[0067] Data Dialysis

[0068] Referring back to FIG. 2, once reference database 220 is createdas illustrated in FIG. 6, a data dialysis process 700 is applied todistill the most accurate data for inclusion in distilled database 230.Data dialysis 700 is now described with reference to FIG. 7.

[0069] Partitioning the Reference Data

[0070] In a step 710, reference database 220 is preferably partitionedor sorted into sets based on some criteria. These sorting criteria mayvary. For example, as illustrated in table 810 of FIG. 8, in thisembodiment, data records may be sorted into sets based on last name,with the values arranged in increasing numeric order (recall thatcontent of raw data is now represented as base-40 numbers in referencedatabase 220). Table 810 is derived from reference database table 620illustrated in FIG. 6, with each entry of table 810 defined by a uniquelast name and having a corresponding set of table 620 records matchingthat last name. In the representation illustrated, table 810 includes afield for defining the set (in this case, a last name), as well asidentifiers for members of the set (such as indices, pointers or otherappropriated references—in this case PIDNs).

[0071] In some embodiments of the present invention, not all vectors inreference database 220 will have data for the field on which the setsare based. Such vectors may be handled in various manners. For example,all vectors in reference database 220 having no data for that data fieldmay be regarded as members of a single, additional set. Alternatively,each vector in reference database 220 having no data for that data fieldmay be regarded as the single member of its own set.

[0072] Identifying Duplicate Data

[0073] Returning to FIG. 7, in a step 720, those data records within thepartitioned sets identified as duplicates are marked. In someembodiments of the present invention, duplicate data may be unnecessaryand may be discarded. In other embodiments, all information remains inreference database 220 as all information, even erroneous, incomplete,or duplicate information may be better than no information and may beuseful for some purpose, such as identifying fraud.

[0074] In some embodiments of the present invention, comparing a pair ofvectors may identify duplicates. Various operations may be used, aswould be apparent. In a simple example, a straightforward vectorsubtraction may be performed to measure the degree of similarity betweentwo records. Other techniques may be used to identify duplicate vectorssuch as using “look-up” tables to identify common names, nicknames,abbreviations, etc.

[0075] Table 810 of FIG. 8 illustrates that the last name “Smith”corresponds to PIDNs 2, 4, 8, 9, and 11, representing vectors formedfrom entries 2, 4, 8, 9, and 11 of the reference database tables 610-670illustrated in FIG. 6: For PIDN 2: [SMITH, J, 98-002, A40, A60,{circumflex over ( )}] For PIDN 4: [SMITH, J, 98-004, A50, B10,{circumflex over ( )}] For PIDN 8: [SMITH, Jennifer, {circumflex over( )}, A40, {circumflex over ( )}, 300 Pine St.] For PIDN 9: [SMITH,John, {circumflex over ( )}, A50, {circumflex over ( )}, 37 Hunt Dr.]For PIDN 11: [SMITH, Jhon, {circumflex over ( )}, B10, {circumflex over( )}, 85 Belmont Ave.]

[0076] Vector (or matrix) operations comparing the vectors andthresholds for determining when two entries are similar enough to beregarded as duplicates may be defined as appropriate for variousembodiments. In a simple example, the sum of the absolute differencesbetween corresponding coefficients of a pair of vectors may indicate asimilarity between the corresponding pair of records. This pair ofvectors may be considered duplicates if a first vector is notinconsistent with any field of a second vector, and does not provide anyadditional data. In this embodiment, additional rules would also bedefined, for example, for comparing entries of different lengths (e.g.,right aligning character strings corresponding to numbers, and leftaligning character strings corresponding to letters), for recognizingcommonly misspelled or spelling variations of words, and for recognizingtransposed letters in words. This processing may be performed by variousmechanisms, as would be apparent. In the example of Table 810 of FIG. 8,none of the data records are exact duplicates, and so none are marked instep 720.

[0077] Correlating Data

[0078] Referring back to FIG. 7, in a step 730, the preferred embodimentof the present invention correlates data records remaining within eachset and in a step 740, further partitions the data records intoindependent subsets of data records. In general, the “correlation”between two vectors is a measurement of how closely one is related tothe other, and specific methods of correlation may vary depending on theintended application. A general discussion and examples of correlationfunctions may be found in references such as NUMERICAL RECIPES IN C: THEART OF SCIENTIFIC COMPUTING (Cambridge University Press, 2nd ed. 1992)by William H. Press, et al. Other techniques and examples may be foundin THE ART OF COMPUTER PROGRAMMING (Addison-Wesley Pub., 1998) by DonaldE. Knuth.

[0079] As an example, a simple measurement of the correlation betweenvectors is their dot product, which may be weighted as appropriate.Depending on the application, the dot product may be calculated on onlya subset of the vector coefficients, or may be defined to compare notonly corresponding coefficients, but also other pairs of coefficientsdetermined to be in related fields (i.e., comparing a “first name”coefficient of a first vector with a “middle name” coefficient of asecond vector). As with the operations for identifying duplicate data,the correlation function may be appropriately tailored for its intendedapplication. For example, a correlation function may be defined toappropriately compare entries of different lengths and to appropriatelydistinguish between significant and insignificant differences, as wouldbe apparent.

[0080] In the embodiment explained with reference to the tables of FIGS.5, 6, and 8, an example of a correlation function compares vectorscorresponding to the members of a set sharing the same last name toidentify independent subsets of vectors. Again, this determination maybe based on application-specific criteria. In this example, independentvectors may be defined to be those vectors representing differentindividuals.

[0081] As a result of applying the correlation function, a correlationparameter reflecting the degree of independence of a pair of vectors isassigned. For example, a high value may be assigned to indicate a highdegree of similarity, and a low value may be assigned to indicate alimited degree of similarity. The correlation value is then compared toa predetermined threshold value—which again, may vary in differentapplications—to determine whether the two records corresponding to thosevectors are considered to be independent.

[0082] Based on the correlation values, in a step 740, the preferredembodiment partitions the data records into subsets of independent datarecords within each set. In the examples of FIG. 5, 6, and Table 810 ofFIG. 8, members of an independent subset may be identified as thosemembers having: the same last name (taking into considerationmisspellings and spelling variations); relatively similar first names(taking into consideration misspellings, spelling variations, nicknames,and combinations of first and middle names and initials); having one ormore matching account numbers; and having no more than three addresses(to allow for work and home addresses, and one change of address).

[0083] Results of applying such a function are illustrated in Table 820of FIG. 8. The individuals identified are:

[0084] Jennifer Brown, PIDN 10;

[0085] Howard Lee, PIDNs 3 and 6;

[0086] Carole Lee, PIDN 7;

[0087] Jennifer Smith, PIDNs 2 and 8;

[0088] John Smith, PIDNs 4 and 11;

[0089] John Smith, PIDN 9;

[0090] Ann Zane, PIDNs 1, 5, and 12; and

[0091] Molly Zane, PIDN 13.

[0092] Other operations for correlating the vectors are available. Thesemay include computing dot products, cross products, lengths, directionvectors, and a plethora of other functions and algorithms used forevaluation according to well-known techniques.

[0093]FIG. 9 illustrates a two-dimensional example of a concept referredto as clustering which is used conceptually to describe some generalaspects of the present invention. In FIG. 9, four clusters exist as acollection of two-dimensional points. These clusters are identified as:(a,b), (c,d), (e,f), and (g,h). As illustrated, each cluster is formedfrom one or more points in the two-dimensional space. Each pointcorresponds to a data record that represents (with more or lessaccuracy) the “true” value of the cluster in the space. As illustrated,clusters (a,b,) and (c,d) are fairly easy to distinguish from oneanother and from clusters (e,f) and (g,h). However, in this simpleexample, clusters (e,f) and (g,h) are not easily distinguished from oneanother. Extending the space (i.e., adding additional data fields to thevectors), may increase the separation between clusters such as (e,f) and(g,h) so that they become more readily distinguished from one another.Alternately, extending the space may indicate that (g,h) is a point thatbelongs to cluster (e,f) or even cluster (c,d). In the abstract, thespace may be extended infinitely, resulting in a Hilbert space, whichhas various well-known characteristics. These characteristics may beexploited by the present invention for large, albeit not infinite,vectors as would be apparent.

[0094] Furthermore, while adding additional data fields to the vectors(i.e., extending the space) may separate clusters from one another toaid in their correlation, deleting data fields from the vectors (i.e.,reducing the space) may also identify some correlations. In someembodiments of the present invention, reducing the space may identifycertain clusters that are in fact representing the same individual orother unique entity. For example, one record in a database may have tendata fields exactly identical to the same ten data fields in a secondrecord in the database. These data fields may correspond to a firstname, a birth date, an address, a mother's maiden name, etc. However,these two records may have two fields that are different. These twofields may correspond to a last name and a social security number. Insome cases, these records may correspond to the same individual. Thepresent invention simplifies the process for identifying these types ofrecords that would be difficult, if not impossible, to detect usingconventional methods.

[0095] Thus, removing one or more particular data fields from a vectorand reducing the corresponding space may reveal clusters that otherwisewould not be apparent. Doing this for data fields traditionally used foridentification purposes (e.g., last name, social security number, etc.)may reveal duplicate records in databases. This may be particularlyuseful for identifying fraud. Removing data fields where a vectorincludes an empty field value for that data field may also revealclusters that would not otherwise be apparent.

[0096] Furthermore, once the clusters are identified as representing thesame individual or entity, the best information for the individual orentity may be extracted from the information provided by each record or“black dot.”

[0097] The principles of the present invention may be extended beyondsimple vectors and data fields. For example, the present invention maybe extended through the use of tensors representing objects in amulti-dimensional space. In this manner, the present invention may beused to represent the parameters of various physical phenomenon to gainadditional insight into their operation and effect. Such application maybe particularly useful for deciphering the human gene and aid in theefforts of programs such as the Human Genome Project.

[0098] Handling Stranded Data

[0099] Referring again to FIG. 7, in a step 750, the preferredembodiment of the present invention evaluates “stranded” data records.Stranded data records are those records from reference database 220 thatwere not partitioned into any set in step 710. In some embodiments,reference database 220 may include a large number of tablescorresponding to data fields and a large number of vectors having datafor various combinations of fields. For example, in an embodiment havinga reference database 220 including 20 tables for different data fieldsand 1000 vectors defined by related data records for each table, supposeonly 800 of those 1000 vectors have data for the field “last name,” bywhich the sets were created in step 710. Step 710 may not partitionthose 200 vectors with no “last name” data into any set, or to partitioneach of those 200 vectors into its own set. In either case, the resultis that those 200 vectors are not correlated with any others in steps720, 730, and 740. Step 750 may evaluate those vectors.

[0100] Methods of evaluation may vary. For example, one embodiment maycorrelate each stranded entry with one member of each subset identifiedin step 740. Depending on the resulting correlation values, that vectormay be added to the subset with which it is most highly correlated, ormay define a new subset. Alternatively, in some embodiments, it may bedetermined that such evaluation is too time-consuming and step 750 maybe completely skipped.

[0101] Repeating the Correlation Process

[0102] Steps 710-750 may be repeated as needed for specific embodiments.As noted above, some embodiments will have reference data 220 having alarge number of fields and a large number of entries, with many entrieshaving data for only a subset of fields. In such a case, performingsteps 710-750 on a single field is unlikely to derive all relevantinformation. Even in the simple example explained with reference toFIGS. 5, 6, and 8, correlating on the single field “last name” mayprovide only partial information about the correlation between thoseentries. For example, Jennifer Smith, corresponding to PIDNs 2 and 8 inFIG. 6, may be the same individual as Jennifer Brown, corresponding toPIDN 10, because PIDNs 2 and 10 may share a common account number.Performing the correlation on the last name field may not identify thesePIDNs as corresponding to the same individual because they wereevaluated only against other PIDNs sharing the same last name.Performing a correlation on the account number field may provideadditional information about whether these PIDNs are related.

[0103] Thus, correlation across various data fields may be necessary tofully evaluate the degree of relatedness of the data in referencedatabase 220.

[0104] Using Correlation Results to Update Reference Data

[0105] Once steps 710-760 are completed, reference database 220 has beendistilled into a distilled database 230, as illustrated in FIG. 2. Insome embodiments of the present invention, these two databases arehandled separately and coexist with one another. In other embodiments ofthe present invention, a single database exists with records marked orotherwise identified as belonging to reference database 220 or distilleddatabase 230. This may be accomplished by assigning by using differentranges of PIDNs for the records in the two databases. Furthermore,relationships between records in the two databases may be maintained byadding a constant value to the PIDN for the record in reference database220 to generate a PIDN for the record in distilled database 230. Forexample, a record with a PIDN of 12345 in reference database 220 mayhave a PIDN of 9012345 in distilled database 230. In this manner, thetwo databases may be treated as distinct portions of a single database.

[0106] Using the Distilled Data

[0107] Once data dialysis process 700 is complete, distilled database230 identifies subsets of data records from the reference database 220as related records, and as noted above, probabilities may be determinedfor fields in the reference database 220 to provide a qualitativemeasure of their completeness. This may be accomplished by assigning aprobability of completeness to each of the individual data fields andthen using them to compute an overall probability of completeness forthe data record. For example, for a data field representing a firstname, a value of ‘J’ may be assigned a low probability (e.g., 0 or 0.1),a value of ‘JOHN’ may be assigned a higher probability (e.g., 0.7 or0.8), and a value of ‘JONATHAN’ may be assigned the highest probability(e.g., 0.9 or 1.0). These values may be assigned somewhat arbitrarily.However, these values help identify which data fields in the set aremost likely to include the most complete information or in other words,the most probable data.

[0108] Use of the present invention may determine a significant amountof information about the records and their relationship to each other,and may be specifically tailored for particular applications.Furthermore, using standard database operations, distilled database 230(which references records of the reference database 220) may bemanipulated to provide formatted reports as needed. For example, anembodiment may be tailored to generate a report listing subsets ofrelated records, with records of a subset providing information about aspecific individual or entity. The records within such a subset mayprovide information, for example about different fields of information;aliases and/or variations of names, addresses, social security numbers,etc., used by the individual; and fields—such as occupation, address,and account numbers—for which that individual may have more than oneentry.

[0109] Recalling that all data is represented in numerical base-40format, the subsets may be ordered numerically in the report. Thebase-40 format provides the additional advantage of representingalphabetical characters as their respective letters (as illustrated inthe conversion table above). Thus, while the report will show entries innumerical representation, that representation retains the semanticsignificance of the data it represents, allowing the data to be manuallyread and analyzed. For example, if the report shows records for anindividual having entries for names including J SMITH, JOHN SMITH, JOHNG SMITH, G SMITH, and GERALD SMITH, a person reading that report wouldunderstand that this individual uses various first names, including hisfirst name or initial, his middle name or initial, or some combinationthereof.

[0110] Adding New Data

[0111] As with conventional database applications, new data may be addedfrom time to time. As illustrated in FIG. 2, the present inventionaccounts for adding new (or changed) data 240, which will affectreference database 220 and distilled database 230.

[0112] Generally, new data records 240 may be formatted as describedwith reference to FIG. 3, and entered into the existing referencedatabase 220. Additionally, new data records 240 may be measured againstdistilled database 230 to determine if new information or content isavailable in new data record 240. For example, a new data record 240 maybe correlated with data records from distilled database 230 to determinewhether that new data record 240 is related to any data records alreadypresent in distilled database 230. If so, and new data record 240contains information or content not already present in distilleddatabase 230, new data record 240 may be used to update distilleddatabase 230. For example, if new data record 240 included informationfor an individual named John Smith that corresponds to data recordsalready present in distilled database 230 but provided the additionalinformation that Mr. Smith's middle name was Greg, that additionalinformation may be appropriately added to distilled database 230.

[0113] Changes to data records in reference database 220 and distilleddatabase 230 may be handled using standard database protectionoperations, as described in references such as C. J. DATE, INTRODUCTIONTO DATABASE SYSTEMS (Addison Wesley, 6th ed. 1994) (see specifically,Part IV), referenced above. For example, in the case that changes aremade to reference database 220 by an authorized database administrator,related data records in reference database 220 are updated as determinedby standard relational definitions and where appropriate, in accordancewith relations defined in distilled database 230.

[0114] Identifying Duplicate Data Between Field Vectors

[0115] One problem associated with conventional databases is adifficulty in merging records from a first database, such as raw data210A, with those from a second database, such as raw data 210B. Recordsin these databases having shared or duplicate data need to be identifiedso that the content included therein may be merged as a single record ina database such as reference database 220 or distilled database 230. Forexample, both databases 210 may include one or more entries for JOHNSMITH. If the respective records in the databases 210 represent the sameindividual John Smith, then the content of each of the records should bemerged as a single record in, for example, distilled database 230.

[0116] Conventional brute force methods for identifying such duplicatedata in these databases involve comparing a data record from the firstdatabase with every data record in the second database, and repeatingthis process for each record in the first database. This process is timeconsuming and computationally intensive. In fact, the number ofcomputations is geometrically related to the number of records in eachof the two databases.

[0117] One process for reducing the time and number of computationsrequired to identify the duplicate data in the databases 210 isdescribed below with reference to FIGS. 10-12. In the process describedbelow, a particular field common or similar among the databases isselected, for example a name field or an address field. This field isarranged as a table or an array for each of the databases that includesthe value of the selected field for each of the records. For example, asdiscussed above, each table 610-670 represents a particular field ofeach of the data records in a database. For purposes of this discussion,these tables are referred to as field vectors.

[0118] According to the present invention, each of the field vectors aresorted in numerical order, and if necessary, partitioned into sets ofidentical data as described above with respect to FIGS. 7 and 8. Forexample, multiple records associated with JOHN SMITH would bepartitioned together within the field vector. Preferably, informationregarding the location of the partitions between the sets is stored.

[0119] Once the field vectors are sorted and partitioned, a value of thefirst element of a first field vector is compared with a value of thefirst element of a second field vector. Essentially, if the value in thefirst field vector is greater than the value in the second field vector,an index into the second field vector is advanced or otherwise adjustedto a position within the next partitioned set to obtain a next value inthe second field vector. This next value in the second field vector isthen compared to the value in the first field vector. This continues aslong as the value in the first field vector is greater than the value inthe second field vector.

[0120] On the other hand, if the value of the first field vector is lessthat the value of the second field vector, an index into the first fieldvector is advanced or otherwise adjusted to a position with the nextpartitioned set to obtain a next value in the first field vector. Thisnext value in the first field vector is then compared to the Value inthe second field vector. This continues as long as the value in thefirst field vector is less than the value in the second field vector.

[0121] When the value of the first field vector equals the value in thesecond field vector, the process has identified duplicate data that isthen preferably stored in a common field vector. After storing theidentified duplicate data, the index into the first field vector and theindex into the second field vector are both advanced or otherwiseadjusted to a position within the next partitioned set of theirrespective field vectors.

[0122] The process thus described may be viewed as feedback controlmechanism that adjusts the index into either of the arrays based on thedifference between the values in the field vectors. In the embodimentdescribed above, a positive difference generates an adjustment to theindex of the second field vector whereas a negative difference generatesan adjustment to the index of the first field vector. This processresults in a linear relationship between the number of values in thefield vectors and the number of computations (i.e., comparisons)required as opposed to the geometric relationship associated withconventional methods.

[0123] The present invention may be extended to sorting mechanisms aswell. In cases where a particular value must be inserted into a fieldvector (i.e., a record must be inserted into a database) based on anordering of the values in the vector (e.g,, alphabetically, numerically,etc.), a difference between the particular value and a value of one ofthe elements in the vector is computed. This difference is “fed back” toadjust the index into the vector to generate the next value from thevector. Using well-established methods of control theory, the indexadjustments may be integrated to determine the proper location of thevalue to be inserted. In addition to the integrator, a proportional gainmay be applied to the difference to establish a desired systemperformance as would be apparent.

[0124] The present invention is now described with reference to FIGS.10-12. FIG. 10 is a flow diagram for identifying duplicate data within apair of field vectors. The field vectors may be from a single sourcesuch as raw data 210A (e.g., when comparing a Residential Address Fieldwith a Mailing Address in a single database) or from multiple sourcessuch as raw data 210A and raw data 210B (e.g., when comparing a NameField between two databases).

[0125] For purposes of this description, the pair of field vectors arereferred to as a first field vector (“FV1”) and a second field vector(“FV2”), respectively. Preferably, the data in these field vectors arebase-40 numbers that represent alphanumeric data as described above.However, in some embodiments of the present invention, the data mayexist in other forms as well.

[0126] In a step 1010, the first field vector is sorted in numericalorder. In a step 1020, the second field vector is also sorted innumerical order. In one embodiment of the present invention, the vectorsare sorted in increasing numerical order, although other embodiments ofthe present invention may sort the vectors in decreasing order as wouldbe apparent.

[0127] In a step 1030, partitioned sets within the first field vectorhaving common values are identified. Likewise, in a step 1040,partitioned sets within the second field vector having common values arealso identified. Steps 1010-1040 perform a similar function to the stepof partitioning reference database 220 described above with reference toFIGS. 7 and 8. In some embodiments of the present invention, the fieldvectors may not include any partitioned sets as the common values withineach field vector may have been eliminated. However, in a preferredembodiment of the present invention, the common values within aparticular field vector are maintained.

[0128] In a step 1050, a common value vector that identifies the commonvalues between the first and second field vectors is determined,preferably using the partitioned sets. Step 1050 is described in furtherdetail with reference to FIG. 11.

[0129]FIG. 11 is a flow diagram for identifying common values between apair of field vectors. In a step 1110, three vector indices areinitialized. A first vector index, I, is an index into the first fieldvector FV1; a second vector index, J, is an index into the second fieldvector FV2; and a third vector index, K, is an index into the commonvalue vector (“CV”). As mentioned above, the common value vectorincludes the values shared by both first and second field vectors.Indices I and J are initialized to locate a first position in each ofthe first and second field vectors, respectively. Index K is initializedto locate a position for a next common value to be included in thecommon value vector.

[0130] In a decision step 1120, the present invention determines whetherthe value in the I-th position of the first field vector is greater thanor equal to the value of the J-th position of the second field vector.If so, processing continues at a decision step 1130; otherwise,processing continues at a step 1170. Step 1170 is performed,effectively, when the value in the I-th position of the first fieldvector is less than the value of the J-th position of the second fieldvector. In step 1170, the first index I is adjusted to locate thebeginning of the next partitioned set in the first field vector. Afterstep 1170, processing continues at a decision step 1160.

[0131] In decision step 1130, the present invention determines whetherthe value in the I-th position of the first field vector is equal to thevalue of the J-th position of the second field vector. If so, processingcontinues at a decision step 1140; otherwise processing continues at astep 1180. Step 1180 is performed, effectively, when the value in theI-th position of the first field vector is greater than value of theJ-th position of the second field vector. In step 1180, the second indexJ is adjusted to locate the beginning of the next partitioned set in thesecond field vector. After step 1180, processing continues at decisionstep 1160.

[0132] Step 1140 is performed, effectively, when the value in the I-thposition of the first field vector is equal to the value of the J-thposition of the second field vector. In step 1140, the value included inboth the first and second field vectors is placed in the common valuevector.

[0133] In a step 1150, the third index K is incremented to locate theposition in the common value vector of the next common value to beidentified. The first index I is adjusted to locate the beginning of thenext partitioned set in the first field vector. The second index J isadjusted to locate the beginning of the next partitioned set in thesecond field vector.

[0134] In decision step 1160, the present invention determines whetheradditional partitioned sets exist in both the first field vector and thesecond field vector. If so, processing continues at step 1120. If nopartitioned sets remain in either the first field vector or the secondfield vector, processing ends. When processing ends, the common valuevector includes all the duplicate data identified between the first andsecond field vectors.

[0135]FIG. 12 illustrates an example of identifying duplicate databetween field vectors according to the present invention. Steps 1010 and1030 sort and partition field vector 1 (“FV1”) and steps 1020 and 1040sort and partition a field vector 2 (“FV2”). The operation of step 1050is now described with reference to steps 1110-1180 where traversalthrough steps 1120 to step 1160 and back to step 1120 is referred to asa “loop.”

[0136] In a first loop, the first element (i.e., 0-th position) of FV1is compared with the first element of FV2. (This is illustrated in FIG.12 as a line between FV1 and FV2 having arrows on both ends andannotated with 1). In this example, a value ‘8’ of FV1 is compared witha value ‘8’ of FV2. Decision steps 1120 and 1130 determine that thesevalues are equal and, in step 1140, the value ‘8’ is placed in thecommon value vector. (This is illustrated in FIG. 12 as a line betweenFV2 and the COMMON VALUE VECTOR having arrows on both ends and annotatedwith 1′.) Step 1150 adjusts the indices of both field vectors to pointat the next partitioned set. Decision step 1160 determines that morepartitioned sets exist in both field vectors and a second loop isstarted.

[0137] In the second loop, the next element of FV1 is compared with thenext element of FV2. In this example, a value ‘9’ of FV1 is comparedwith a value ‘9’ of FV2. These values are again determined to be equaland the value ‘9’ is placed in the common value vector. As before, step1150 adjusts both indices to point at the next partitioned sets in theirrespective field vectors. Decision step 1160 determines that morepartitioned sets exist in both field vectors and a third loop isstarted.

[0138] In the third loop, the next element of FV1 is compared with thenext element of FV2. In this example a value ‘10’ of FV1 is comparedwith a value ‘12’ of FV2. Decision step 1120 determines that the valuein FV1 is not greater than or equal to the value in FV2 and, in step1170, the index to FV1 is adjusted to point at the next partitioned settherein. Decision step 1160 determines that more partitioned sets existin both field vectors and a fourth loop is started.

[0139] In the fourth loop, the next element of FV1 is compared with theprevious value of FV2. In this example, a value ‘12’ of FV1 is comparedwith the previously compared value of ‘12’ of FV2. Decision steps 1120and 1130 determine that the values are equal, and in step 1140, thevalue ‘12’ is placed in the common value vector. Step 1150 adjusts bothindices to point at the next partitioned sets in their respective fieldvectors. Decision step 1160 determines that more partitioned sets existin both field vectors and a fifth loop is started.

[0140] In the fifth loop, the next element of FV1 is compared with thenext value of FV2. In this example, a value ‘15’ of FV1 is compared witha value ‘18’ of FV2. Decision step 1120 determines that the value in FV1is not greater than or equal to the value in FV2 and, in step 1170, theindex to FV1 is adjusted to point at the next partitioned set therein.Because no more partitioned sets exist in FV1, processing ends.

[0141] In this example, five loops with a maximum of two comparisons perloop are required to identify three common values between the two fieldvectors. In a brute force method, 132 comparisons (12*11) are required.

[0142] Various embodiments of the present invention may be used for manydifferent applications, some of which have been described and/or alludedto above. For example, in the application described above, the inventionmay be used to combine billing information collected from multiplesources to derive a distilled database in which related data records arerecognized and duplicate and erroneous data records are eliminated. Assuggested, this may be particularly useful in cases, for example,involving fraud. Typically, persons using credit card or other forms ofretail fraud make minor changes to certain pieces of their personalinformation while leaving the majority of it the same. For example,oftentimes, digits in a social security number may be transposed or analias may be used. Often, however, other information such as theperson's address, date of birth, mother's maiden name, etc., is usedidentically. These types of fraud are readily identified by the presentinvention, even though they are difficult to identify by human analyses.

[0143] Other possible applications include uses in telemarketing, tocompile a list of targeted individuals or addresses, or in mail-ordercatalogs, to reduce a number of catalogs sent to the same individual orfamily. Still another potential application is in the medical researchor diagnostics fields, in which nucleotide sequences of Adenine (A),Guanine (G), Cytosine (C), and Thymine (T) in nucleic acids may beidentified.

[0144] In other embodiments, the present invention may be used as agatekeeper for a particular database at the outset to maintain integrityof the database from the very beginning, rather than achieving integrityin the database at a later date. In these embodiments, no raw data 210is present and only new data 240 exists. Before new data 240 is added tothe database, it is measured against distilled database 230 to determinewhether new data 240 includes additional information or content. If so,only that new information or content is added to distilled database 230by updating an existing record in distilled database 230 to reflect thenew information or content as would be apparent.

[0145] While this invention has been described in a preferredembodiment, other embodiments and variations are within the scope of thefollowing claims. For example, formatting process 300 may format datausing different radices or other character sets, and may use variousdata structures. The data structures may represent multiple fields, anddepending on the application, will represent a variety of fields. Forexample, in a credit application, fields may include an account status,an account number, and a legal status, in addition to personalinformation about the account holder. In a medical diagnosticapplication, fields may include various alleles or other geneticcharacteristics detected in tissue samples.

What is claimed is:
 1. A method for identifying duplicate data between afirst field vector and a second field vector comprising: sorting thefirst field vector in a particular order; sorting the second fieldvector in said particular order; comparing a first value at a firstindex in the first field vector with a second value at a second index inthe second field vector; if said first value is not equal to said secondvalue, adjusting either said first index or said second index based on adifference between said first value and said second value; and if saidfirst value is equal to said second value, determining said first andsecond values as duplicate data.
 2. The method of claim 1, wherein saidsorting the first field vector in a particular order comprises sortingthe first field vector in an increasing order, and wherein said sortingthe second field vector in said particular order comprises sorting thesecond field vector in said increasing order.
 3. The method of claim 1,wherein said sorting the first field vector in a particular ordercomprises sorting the first field vector in a decreasing order, andwherein said sorting the second field vector in said particular ordercomprises sorting the second field vector in said decreasing order. 4.The method of claim 2, wherein said adjusting either said first index orsaid second index comprises adjusting said first index if said firstvalue is less than said second value.
 5. The method of claim 2, whereinsaid adjusting either said first index or said second index comprisesadjusting said second index if said second value is less than said firstvalue.
 6. A method for identifying duplicate data between a first fieldvector and a second field vector comprising: sorting the first fieldvector in a particular order; sorting the second field vector in saidparticular order; comparing a first value at a first index in the firstfield vector with a second value at a second index in the second fieldvector; if said first value is not equal to said second value, adjustingone of said first index and said second index based on a differencebetween said first value and said second value; and if said first valueis equal to said second value, determining said first and second valuesas duplicate data, wherein said sorting the first field vector in aparticular order comprises sorting the first field vector in anincreasing order, and wherein said sorting the second field vector insaid particular order comprises sorting the second field vector in saidincreasing order, and wherein said adjusting one of said first index andsaid second index comprises: adjusting said first index if said firstvalue is less than said second value, and adjusting said second index ifsaid second value is less than said first value.
 7. The method of claim2, wherein said adjusting either said first index or said second indexcomprises incrementing either said first index or said second indexbased on whether said first value is greater than said second value. 8.The method of claim 3, wherein said adjusting either said first index orsaid second index comprises decrementing either said first index or saidsecond index based on whether said first value is greater than saidsecond value.
 9. The method of claim 1, wherein said first value is anumeric value, and wherein said second value is a numeric value.
 10. Themethod of claim 9, wherein said first value is a numeric value thatrepresents an alphanumeric value, and wherein said second value is anumeric value that represents an alphanumeric value.
 11. The method ofclaim 1, further comprising: partioning said first field vector into atleast one set of common values; and partioning said second field vectorinto at least one set of common values.
 12. The method of claim 11,wherein said adjusting either said first index or said second indexcomprises adjusting either said first index or said second index to anext partitioned set in a respective one of said first field or saidsecond field vector.
 13. The method of claim 2, wherein said adjustingeither said first index or said second index comprises: adjusting saidfirst index if said first value is less than said second value; andadjusting said second index if said second value is less than said firstvalue.
 14. The method of claim 3, wherein said adjusting either saidfirst index or said second index comprises: adjusting said first indexif said first value is greater than said second value; and adjustingsaid second index if said second value is greater than said first value.15. A method for identifying duplicate data between a first field vectorand a second field vector, the first field vector and the second fieldvector sorted in a particular order, the method comprising: partitioningsaid first field vector into sets of common values; partitioning saidsecond field vector into sets common values; comparing a first value ina first position in the first field vector with a second value at asecond position in the second field vector; if said first value is notequal to said second value, adjusting either said first position or saidsecond position based on a difference between said first value and saidsecond value; and if said first value is equal to said second value,determining said first and second values as duplicate data.
 16. Themethod of claim 15, wherein said adjusting either said first position orsaid second position comprises adjusting either said first position orsaid second position to a next partitioned set of a respective one ofsaid first field vector or said second field vector.
 17. The method ofclaim 16, wherein the first and second field vectors are sorted inincreasing numeric order and wherein said adjusting either said firstposition or said second position comprises: adjusting said firstposition to a next partitioned set in said first field vector if saidfirst value is less than said second value; and adjusting said secondposition to a next partitioned set in said second field vector if saidsecond value is less than said first value.
 18. The method of claim 16,wherein the first and second field vectors are sorted in decreasingnumeric order and wherein said adjusting either said first position orsaid second position comprises: adjusting said first position to a nextpartitioned set in said first field vector if said first value isgreater than said second value; and adjusting said second position to anext partitioned set in said second field vector if said second value isgreater than said first value.
 19. A method for sorting data comprising:receiving a value to be sorted; determining a first position in a vectorwhere said value is to be included; retrieving a vector value from saidvector at said first position; feeding back said vector value todetermine a difference between said value and said vector value; anddetermining a new position in said vector based at least in part on saiddifference.
 20. The method of claim 19, wherein said determining a newposition comprises determining a new position in said vector based atleast in part on said first position.